The dataset I will be using for this analysis is video game data collected from vgchartz.com. This dataset contains a list of video games with sales greater than 100,000 copies. This data includes:
Rank - Overall rank of the game, determined by total sales made
Name - The name of the video game
Platform - The type of system the game is played on
Year - The year the video game was made
Genre - The genre of game
Publisher - The company that made the game
NA_Sales - Number of sales in North America
EU_Sales - Number of sales in Europe
JP_Sales - Number of sales in Japan
Other_Sales - Number of sales in countries other than the above
Global_Sales - The total number of sales
For this project I wanted to know which factors contributed the most to a best-selling video game and why. The objectives of this final project are to a) test hypothesis through analyzing the data and b) portray results new R package, plotly.
After cleaning our data, we’ll take a look at the statistics of our dataset.
glimpse(cvg)
## Rows: 66,392
## Columns: 9
## $ Rank <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, ...
## $ Name <chr> "Wii Sports", "Wii Sports", "Wii Sports", "Wii Sports"...
## $ Platform <chr> "Wii", "Wii", "Wii", "Wii", "NES", "NES", "NES", "NES"...
## $ Year <dbl> 2006, 2006, 2006, 2006, 1985, 1985, 1985, 1985, 2008, ...
## $ Genre <chr> "Sports", "Sports", "Sports", "Sports", "Platform", "P...
## $ Publisher <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Ninte...
## $ Global_Sales <dbl> 82.74, 82.74, 82.74, 82.74, 40.24, 40.24, 40.24, 40.24...
## $ Country <chr> "North America", "Europe", "Japan", "Other", "North Am...
## $ Sales <dbl> 41.49, 29.02, 3.77, 8.46, 29.08, 3.58, 6.81, 0.77, 15....
summary(cvg)
## Rank Name Platform Year
## Min. : 1 Length:66392 Length:66392 Min. :1980
## 1st Qu.: 4151 Class :character Class :character 1st Qu.:2003
## Median : 8300 Mode :character Mode :character Median :2007
## Mean : 8301 Mean :2006
## 3rd Qu.:12450 3rd Qu.:2010
## Max. :16600 Max. :2020
## NA's :1084
## Genre Publisher Global_Sales Country
## Length:66392 Length:66392 Min. : 0.0100 Length:66392
## Class :character Class :character 1st Qu.: 0.0600 Class :character
## Mode :character Mode :character Median : 0.1700 Mode :character
## Mean : 0.5374
## 3rd Qu.: 0.4700
## Max. :82.7400
##
## Sales
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0100
## Mean : 0.1343
## 3rd Qu.: 0.0900
## Max. :41.4900
##
skim(cvg)
| Name | cvg |
| Number of rows | 66392 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1 | 1 | 132 | 0 | 11493 | 0 |
| Platform | 0 | 1 | 2 | 4 | 0 | 31 | 0 |
| Genre | 0 | 1 | 4 | 12 | 0 | 12 | 0 |
| Publisher | 0 | 1 | 3 | 38 | 0 | 579 | 0 |
| Country | 0 | 1 | 5 | 13 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 0 | 1.00 | 8300.61 | 4791.75 | 1.00 | 4151.00 | 8300.50 | 12450.00 | 16600.00 | ▇▇▇▇▇ |
| Year | 1084 | 0.98 | 2006.41 | 5.83 | 1980.00 | 2003.00 | 2007.00 | 2010.00 | 2020.00 | ▁▁▃▇▂ |
| Global_Sales | 0 | 1.00 | 0.54 | 1.55 | 0.01 | 0.06 | 0.17 | 0.47 | 82.74 | ▇▁▁▁▁ |
| Sales | 0 | 1.00 | 0.13 | 0.52 | 0.00 | 0.00 | 0.01 | 0.09 | 41.49 | ▇▁▁▁▁ |
cvg %>%
summarize(sum(Global_Sales), mean(Global_Sales))
## # A tibble: 1 x 2
## `sum(Global_Sales)` `mean(Global_Sales)`
## <dbl> <dbl>
## 1 35682. 0.537
unique(cvg$Platform)
## [1] "Wii" "NES" "GB" "DS" "X360" "PS3" "PS2" "SNES" "GBA" "3DS"
## [11] "PS4" "N64" "PS" "XB" "PC" "2600" "PSP" "XOne" "GC" "WiiU"
## [21] "GEN" "DC" "PSV" "SAT" "SCD" "WS" "NG" "TG16" "3DO" "GG"
## [31] "PCFX"
unique(cvg$Year)
## [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31] NA 1995 2016 1991 1981 1987 1980 1983 2020 2017
unique(cvg$Genre)
## [1] "Sports" "Platform" "Racing" "Role-Playing" "Puzzle"
## [6] "Misc" "Shooter" "Simulation" "Action" "Fighting"
## [11] "Adventure" "Strategy"
unique(cvg$Country)
## [1] "North America" "Europe" "Japan" "Other"
By looking at this dataset, it is clear that the rank of the game is based off how well the game did in terms of sales. Sales and Rank are our dependent variables, while Platform, Year, Genre, and Publisher are our independent variables. We want to know which of our independent variables contributes the most to the Global Sales/Rank so that we can make predictions. We will make several graphs examining each variable and its relationship with our dependent variables in order to know what to test.
These two graphs show the correlation between Rank and Global Sales between different countries. The second graph is interactive, courtesy of the Plotly package. By hovering over different coordinates, you can see the exact number of sales in each country , as well as the exact rank.
The positive correlation between Rank and Sales is obvious when looking at this graph. But what the graph also sheds light on is the fact there were the most sales in North America. This could be due to the US being more entertainment-based than Japan or Europe. The population of Japan is smaller than Europe or America so that definitely could be a factor here in this graph. Let’s look at another graph.
This graph shows the most popular genre of video game is Action with 13,264 counts. Sports is a not-so close second place with 9,384 counts. The puzzle genre showed up the least amount compared to the others, with only 2328 counts. These results make sense, as the term “Action” can apply to a large variety of different games. The Puzzle genre seems like it could be quite limited, with less games in production, and this would explain its relatively low count. An explanation for the success of the Sports genre could be that it has games with rights to the NFL which had a huge fanbase already, resulting in a sales increase. Let’s look at another graph.
This graph shows which platforms appear in our data the most. It appears the Nintendo DS and PS2 sold the most compared to the other platforms. The PS2 is already widely regarded as the most successful video game console as it has sold more game copies than any other console. Most of this success is due to the console being the first revolutionary first installment in the world of modern gaming. It also functioned as a DVD player, so its flexible usage appealed to a huge audience.
This graph combines both Platform and Genre. By looking closely at the graph we can see most a large chunk of PS2 sales came from the Sports genre which, as we saw previously, was a popular genre.
This bargraph shows how frequently each year appears on the list. It appears most of the games in the dataset were developed from 2001 to 2011.
This graph shows that the actual sales count was the highest with games made around 1987 to 1989. We can double check the data.
## sum(sal$Global_Sales) mean(sal$Global_Sales)
## 1 47.22 3.148
Our data shows us that 1988 definitely does have a higher sales average than what we calculated the dataset as a whole to have above. I don’t know why this is the case, but hopefully we will unearth some answers using the data further.
This graph is faceted based on the sales in each year in each country. This graph also shows that there were more sales in North America. Like the above graph, North America and Japan show a peak around 1988. Europe shows a slight bump around that time even though we have seen previously that Europe had more sales than Japan. To explain this trend, one explanation could be some games made around that time must’ve been popularized by Japanese and Western culture and sold better.
top <- slice(vg, 1:100)
We have subset the data to include the top 100 ranked games in the datasheet. This will help us make and test predictions for the rest of the dataset.
glimpse(top)
## Rows: 100
## Columns: 11
## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Name <chr> "Wii Sports", "Super Mario Bros.", "Mario Kart Wii", "...
## $ Platform <chr> "Wii", "NES", "Wii", "Wii", "GB", "GB", "DS", "Wii", "...
## $ Year <dbl> 2006, 1985, 2008, 2009, 1996, 1989, 2006, 2006, 2009, ...
## $ Genre <chr> "Sports", "Platform", "Racing", "Sports", "Role-Playin...
## $ Publisher <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Ninte...
## $ NA_Sales <dbl> 41.49, 29.08, 15.85, 15.75, 11.27, 23.20, 11.38, 14.03...
## $ EU_Sales <dbl> 29.02, 3.58, 12.88, 11.01, 8.89, 2.26, 9.23, 9.20, 7.0...
## $ JP_Sales <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.93, 4.70,...
## $ Other_Sales <dbl> 8.46, 0.77, 3.31, 2.96, 1.00, 0.58, 2.90, 2.85, 2.26, ...
## $ Global_Sales <dbl> 82.74, 40.24, 35.82, 33.00, 31.37, 30.26, 30.01, 29.02...
summary(top)
## Rank Name Platform Year
## Min. : 1.00 Length:100 Length:100 Min. :1982
## 1st Qu.: 25.75 Class :character Class :character 1st Qu.:2001
## Median : 50.50 Mode :character Mode :character Median :2007
## Mean : 50.50 Mean :2005
## 3rd Qu.: 75.25 3rd Qu.:2011
## Max. :100.00 Max. :2015
## Genre Publisher NA_Sales EU_Sales
## Length:100 Length:100 Min. : 0.980 Min. : 0.010
## Class :character Class :character 1st Qu.: 3.862 1st Qu.: 2.315
## Mode :character Mode :character Median : 5.675 Median : 3.385
## Mean : 7.050 Mean : 4.207
## 3rd Qu.: 8.290 3rd Qu.: 5.043
## Max. :41.490 Max. :29.020
## JP_Sales Other_Sales Global_Sales
## Min. : 0.000 Min. : 0.080 Min. : 7.340
## 1st Qu.: 0.235 1st Qu.: 0.550 1st Qu.: 8.975
## Median : 1.635 Median : 0.865 Median :11.255
## Mean : 2.046 Mean : 1.318 Mean :14.620
## 3rd Qu.: 3.280 3rd Qu.: 1.620 3rd Qu.:15.925
## Max. :10.220 Max. :10.570 Max. :82.740
top %>%
summarize(sum(Global_Sales), mean(Global_Sales))
## sum(Global_Sales) mean(Global_Sales)
## 1 1461.98 14.6198
unique(top$Platform)
## [1] "Wii" "NES" "GB" "DS" "X360" "PS3" "PS2" "SNES" "GBA" "3DS"
## [11] "PS4" "N64" "PS" "XB" "PC" "2600" "PSP"
The mean of this subset is significantly higher than the dataset as a whole.
By looking at the subsetted graph, it appears more games are concentrated in the 2010 and 2011 years.
The subset makes plotting these variables against each other easier.
After visualizing the data we are ready to make predictions about our data. We will test the relationship of our independent variables with our dependent variable, Global Sales.
mod1 <- aov(data = top,
formula = Global_Sales ~ Platform + Year + Publisher + Genre)
residuals(mod1)^2 %>% mean() %>% sqrt()
## [1] 7.12078
summary(mod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Platform 16 2278 142.4 1.797 0.0512 .
## Year 1 367 367.2 4.634 0.0351 *
## Publisher 8 650 81.3 1.026 0.4259
## Genre 10 1408 140.8 1.777 0.0831 .
## Residuals 64 5071 79.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA table, we have a slightly significant variable, Year. Platform looks like the next influential, with Genre next. Now we can plot our predictions to see how useful the model is.
pred <- add_predictions(top, mod1)
The model didn’t do too bad in predicting Global Sales. We can try to extend our model to our full dataset.
residuals(mod3)^2 %>% mean() %>% sqrt()
## [1] 1.525477
summary(mod3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 224 223.54 95.81 <2e-16 ***
## Platform 30 1497 49.91 21.39 <2e-16 ***
## Genre 11 308 28.04 12.02 <2e-16 ***
## Residuals 16284 37994 2.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 271 observations deleted due to missingness
predd <- add_predictions(vg, mod3)
When it comes to a larger dataset, the model didn’t do as good a job predicting datapoints. The differences can be seen between the first and second set of graphs. So far it appears that Year is the strongest predictor in what makes a best-selling game. However, since our ANOVA table stated each factor was significant (which is not very helpful), it appears more data is needed to fully answer the question.